Data engineering¶

This notebook describes the process to download and prepare United States presidential election data. This notebook addresses missing values, reformats data types, and restructures the format of the data tables.

To download and prepare the election data, ArcPy, ArcGIS API for Python, and pandas modules are used.

In [ ]:
 
Import needed modules¶
In [3]:
import arcgis
import pandas as pd
import os
import arcpy

Load and prepare election data¶

When reading the CSV into a pandas data frame, it is necessary to ensure that the county_fips field is read as an object, rather than a number, to prevent deletion of the leading 0 character used by some counties.

Read data into Python¶
In [4]:
# The CSV file is in the same folder as this notebook; providing the CSV file name is sufficient.
table_csv_path = r"C:\LocalData\sp991\GEOM183\Practical1\DataEngineering_and_Visualization\countypres_2000-2020.csv"
elections_complete_df = pd.read_csv(table_csv_path, dtype = {'county_fips':str})
elections_complete_df.head()
Out[4]:
year state state_po county_name county_fips office candidate party candidatevotes totalvotes version mode
0 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT AL GORE DEMOCRAT 4942 17208 20220315 TOTAL
1 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT GEORGE W. BUSH REPUBLICAN 11993 17208 20220315 TOTAL
2 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT RALPH NADER GREEN 160 17208 20220315 TOTAL
3 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT OTHER OTHER 113 17208 20220315 TOTAL
4 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT AL GORE DEMOCRAT 13997 56480 20220315 TOTAL

Some of the table headings could be improved to better identify the data in the column. You will rename these fields and; then use the pandas head function to return the first five rows of the table to confirm that the new headings are correct.

In [5]:
rename_cols = {
    "state_po": "state_abbr",
    "county_fips": "FIPS",
    "party": "pol_identity"
}
elections_complete_df.rename(columns=rename_cols, inplace=True)
elections_complete_df.head()
Out[5]:
year state state_abbr county_name FIPS office candidate pol_identity candidatevotes totalvotes version mode
0 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT AL GORE DEMOCRAT 4942 17208 20220315 TOTAL
1 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT GEORGE W. BUSH REPUBLICAN 11993 17208 20220315 TOTAL
2 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT RALPH NADER GREEN 160 17208 20220315 TOTAL
3 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT OTHER OTHER 113 17208 20220315 TOTAL
4 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT AL GORE DEMOCRAT 13997 56480 20220315 TOTAL

The election data includes records that are missing data in the FIPS field. This missing data is referred to as null values. You will later join this table to voting-age population data based on the FIPS value, so this value cannot be null. You will identify how many rows have null values and establish a strategy for dealing with those null values.

In [6]:
# Find records with null values.
elections_complete_df.query("FIPS.isnull()")
Out[6]:
year state state_abbr county_name FIPS office candidate pol_identity candidatevotes totalvotes version mode
12452 2000 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT AL GORE DEMOCRAT 0 0 20220315 TOTAL
12453 2000 MAINE ME MAINE UOCAVA NaN US PRESIDENT AL GORE DEMOCRAT 0 0 20220315 TOTAL
12454 2000 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT AL GORE DEMOCRAT 0 0 20220315 TOTAL
12455 2000 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT GEORGE W. BUSH REPUBLICAN 0 0 20220315 TOTAL
12456 2000 MAINE ME MAINE UOCAVA NaN US PRESIDENT GEORGE W. BUSH REPUBLICAN 0 0 20220315 TOTAL
12457 2000 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT GEORGE W. BUSH REPUBLICAN 0 0 20220315 TOTAL
12458 2000 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT RALPH NADER GREEN 0 0 20220315 TOTAL
12459 2000 MAINE ME MAINE UOCAVA NaN US PRESIDENT RALPH NADER GREEN 0 0 20220315 TOTAL
12460 2000 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT RALPH NADER GREEN 0 0 20220315 TOTAL
12461 2000 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
12462 2000 MAINE ME MAINE UOCAVA NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
12463 2000 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
21806 2004 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT JOHN KERRY DEMOCRAT 0 0 20220315 TOTAL
21807 2004 MAINE ME MAINE UOCAVA NaN US PRESIDENT JOHN KERRY DEMOCRAT 0 0 20220315 TOTAL
21808 2004 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT JOHN KERRY DEMOCRAT 0 0 20220315 TOTAL
21809 2004 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT GEORGE W. BUSH REPUBLICAN 0 0 20220315 TOTAL
21810 2004 MAINE ME MAINE UOCAVA NaN US PRESIDENT GEORGE W. BUSH REPUBLICAN 0 0 20220315 TOTAL
21811 2004 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT GEORGE W. BUSH REPUBLICAN 0 0 20220315 TOTAL
21812 2004 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
21813 2004 MAINE ME MAINE UOCAVA NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
21814 2004 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
31157 2008 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT BARACK OBAMA DEMOCRAT 0 0 20220315 TOTAL
31158 2008 MAINE ME MAINE UOCAVA NaN US PRESIDENT BARACK OBAMA DEMOCRAT 0 0 20220315 TOTAL
31159 2008 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT BARACK OBAMA DEMOCRAT 0 0 20220315 TOTAL
31160 2008 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT JOHN MCCAIN REPUBLICAN 0 0 20220315 TOTAL
31161 2008 MAINE ME MAINE UOCAVA NaN US PRESIDENT JOHN MCCAIN REPUBLICAN 0 0 20220315 TOTAL
31162 2008 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT JOHN MCCAIN REPUBLICAN 0 0 20220315 TOTAL
31163 2008 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
31164 2008 MAINE ME MAINE UOCAVA NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
31165 2008 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT OTHER OTHER 0 0 20220315 TOTAL
40508 2012 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT BARACK OBAMA DEMOCRAT 0 918 20220315 TOTAL
40509 2012 MAINE ME MAINE UOCAVA NaN US PRESIDENT BARACK OBAMA DEMOCRAT 2071 3054 20220315 TOTAL
40510 2012 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT BARACK OBAMA DEMOCRAT 268 333 20220315 TOTAL
40511 2012 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT MITT ROMNEY REPUBLICAN 0 918 20220315 TOTAL
40512 2012 MAINE ME MAINE UOCAVA NaN US PRESIDENT MITT ROMNEY REPUBLICAN 858 3054 20220315 TOTAL
40513 2012 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT MITT ROMNEY REPUBLICAN 53 333 20220315 TOTAL
40514 2012 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT OTHER OTHER 918 918 20220315 TOTAL
40515 2012 MAINE ME MAINE UOCAVA NaN US PRESIDENT OTHER OTHER 125 3054 20220315 TOTAL
40516 2012 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT OTHER OTHER 12 333 20220315 TOTAL
49859 2016 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT HILLARY CLINTON DEMOCRAT 0 2616 20220315 TOTAL
49860 2016 MAINE ME MAINE UOCAVA NaN US PRESIDENT HILLARY CLINTON DEMOCRAT 3017 3986 20220315 TOTAL
49861 2016 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT HILLARY CLINTON DEMOCRAT 637 728 20220315 TOTAL
49862 2016 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT DONALD TRUMP REPUBLICAN 0 2616 20220315 TOTAL
49863 2016 MAINE ME MAINE UOCAVA NaN US PRESIDENT DONALD TRUMP REPUBLICAN 648 3986 20220315 TOTAL
49864 2016 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT DONALD TRUMP REPUBLICAN 53 728 20220315 TOTAL
49865 2016 CONNECTICUT CT STATEWIDE WRITEIN NaN US PRESIDENT OTHER OTHER 2616 2616 20220315 TOTAL
49866 2016 MAINE ME MAINE UOCAVA NaN US PRESIDENT OTHER OTHER 321 3986 20220315 TOTAL
49867 2016 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT OTHER OTHER 38 728 20220315 TOTAL
53317 2020 DISTRICT OF COLUMBIA DC DISTRICT OF COLUMBIA NaN US PRESIDENT JOSEPH R BIDEN JR DEMOCRAT 317323 344356 20220315 TOTAL
53318 2020 DISTRICT OF COLUMBIA DC DISTRICT OF COLUMBIA NaN US PRESIDENT OTHER GREEN 1726 344356 20220315 TOTAL
53319 2020 DISTRICT OF COLUMBIA DC DISTRICT OF COLUMBIA NaN US PRESIDENT JO JORGENSEN LIBERTARIAN 2036 344356 20220315 TOTAL
53320 2020 DISTRICT OF COLUMBIA DC DISTRICT OF COLUMBIA NaN US PRESIDENT OTHER OTHER 4685 344356 20220315 TOTAL
53321 2020 DISTRICT OF COLUMBIA DC DISTRICT OF COLUMBIA NaN US PRESIDENT DONALD J TRUMP REPUBLICAN 18586 344356 20220315 TOTAL
63983 2020 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT JOSEPH R BIDEN JR DEMOCRAT 1276 1374 20220315 TOTAL
63984 2020 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT JO JORGENSEN LIBERTARIAN 6 1374 20220315 TOTAL
63985 2020 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT OTHER OTHER 7 1374 20220315 TOTAL
63986 2020 RHODE ISLAND RI FEDERAL PRECINCT NaN US PRESIDENT DONALD J TRUMP REPUBLICAN 85 1374 20220315 TOTAL

After reviewing the records, you notice that some of the records represent votes cast in Washington, DC, for the 2020 election. Removing these records would affect hundreds of thousands of votes and make it impossible to include Washington, DC, in the analysis for 2020. Additional research indicates that the appropriate FIPS code for Washington, DC, is 11001. You will add the correct FIPS code for Washington, DC, records.

In [7]:
# Add the FIPS code for Washington, DC.
elections_complete_df.loc[elections_complete_df['state_abbr'] == 'DC', 'FIPS'] = '11001'

The remaining records with null FIPS values are votes assigned at a state-wide level, not to any county. Your analysis will take place at the county level, which means that these records cannot be used. You will assess the effect that removing these votes will have on your overall analysis.

In [8]:
# Determine the total number of records in the table.
rowcount = elections_complete_df.shape[0]

# Determine the total number of votes cast across all counties in all elections.
votecount = elections_complete_df["candidatevotes"].sum()

# Determine how many votes are associated with null FIPS values.
null_fips_df = elections_complete_df.query("FIPS.isnull()")
null_fips_rowcount = null_fips_df.shape[0]
null_fips_votecount = null_fips_df["candidatevotes"].sum()

# Calculate the percentage of the data affected by null records.
percentage_null_fips_rows = round((null_fips_rowcount / rowcount) * 100, 3)
percentage_null_fips_votes = round((null_fips_votecount / votecount) * 100, 3)

# Use a print statement to report this information.
print(f"There were {null_fips_rowcount} records with null FIPS values in the data, representing {null_fips_votecount} votes.")
print(f"This amounts to {percentage_null_fips_rows}% of the total records and {percentage_null_fips_votes}% of the total votes.")
There were 52 records with null FIPS values in the data, representing 13009 votes.
This amounts to 0.072% of the total records and 0.002% of the total votes.

The records with null FIPS values represent a small percentage of the total, and they are unlikely to change the overall analysis. You will create a new dataframe that includes only records with non-null FIPS values.

In [9]:
# Remove records with null FIPS values.
elections_df = elections_complete_df.query("FIPS.notnull()")

Even after correcting for the missing FIPS values, the format of the election data table will prevent a proper join to the voting-age population data. Currently, each record in the table corresponds to a candidate's votes for each county for each vote mode. You need to reformat the table so that each record corresponds to a single county for each election year, with fields showing the total votes for each major party candidate for that election year in that county. It is possible to do this using the Pivot Table geoprocessing tool or Excel pivot tables, but Python may make it easier to automate and share.

The steps in restructuring the table are as follows:

  1. Limit records to only candidates who identify as either a Democrat or a Republican
  2. Sum each candidate's votes for every mode in each county, for each election year
  3. Lock some fields in place, and then pivot the table using the party field.
  4. Reformat the pivoted table.

The following code cell performs these steps. reformat_table.gif

In [10]:
# Filter records by political identity for Democrats and Republicans.
elections_df = elections_df.query("pol_identity in ['DEMOCRAT', 'REPUBLICAN']")
In [11]:
# Group records for each candidate by county and year.
candidate_group = elections_df.groupby(['FIPS', 'county_name', 'state', 'candidate', 'year', 'pol_identity', 'totalvotes'])

# Sum each candidate's votes in each county, for each election year.
candidate_votes = candidate_group['candidatevotes'].sum()

# Remove the multi-index created by the group-by and sum, returning a new data frame.
candidate_votes_df = candidate_votes.reset_index()

# Check the output.
candidate_votes_df.head()
Out[11]:
FIPS county_name state candidate year pol_identity totalvotes candidatevotes
0 01001 AUTAUGA ALABAMA AL GORE 2000 DEMOCRAT 17208 4942
1 01001 AUTAUGA ALABAMA BARACK OBAMA 2008 DEMOCRAT 23641 6093
2 01001 AUTAUGA ALABAMA BARACK OBAMA 2012 DEMOCRAT 23932 6363
3 01001 AUTAUGA ALABAMA DONALD J TRUMP 2020 REPUBLICAN 27770 19838
4 01001 AUTAUGA ALABAMA DONALD TRUMP 2016 REPUBLICAN 24973 18172
In [12]:
# Group records for each candidate by county and year.
candidate_group = elections_df.groupby(['FIPS', 'county_name', 'state', 'candidate', 'year', 'pol_identity', 'totalvotes'])

# Sum each candidate's votes in each county, for each election year.
candidate_votes = candidate_group['candidatevotes'].sum()

# Remove the multi-index created by the group-by and sum, returning a new data frame.
candidate_votes_df = candidate_votes.reset_index()

# Check the output.
candidate_votes_df.head()
Out[12]:
candidatevotes
pol_identity DEMOCRAT REPUBLICAN
year FIPS county_name state totalvotes
2000 01001 AUTAUGA ALABAMA 17208 4942 11993
01003 BALDWIN ALABAMA 56480 13997 40872
01005 BARBOUR ALABAMA 10395 5188 5096
01007 BIBB ALABAMA 7101 2710 4273
01009 BLOUNT ALABAMA 17973 4977 12667
In [13]:
# Remove the multi-index since you no longer need these fields to be "locked" for the pivot.
elections_pivot_df.columns = elections_pivot_df.columns.get_level_values(1).rename(None)
elections_pivot_df.reset_index(inplace=True)

# Rename columns to better reflect their new meaning.
elections_pivot_df.rename(columns={"DEMOCRAT": "votes_dem","REPUBLICAN": "votes_gop"}, inplace=True)

# Check the output.
elections_pivot_df.head()
Out[13]:
year FIPS county_name state totalvotes votes_dem votes_gop
0 2000 01001 AUTAUGA ALABAMA 17208 4942 11993
1 2000 01003 BALDWIN ALABAMA 56480 13997 40872
2 2000 01005 BARBOUR ALABAMA 10395 5188 5096
3 2000 01007 BIBB ALABAMA 7101 2710 4273
4 2000 01009 BLOUNT ALABAMA 17973 4977 12667

Pandas has three powerful capabilities that helped perform this operation:

  • The ability to perform an operation using a "groupby" function
  • The ability to set an index using multiple fields, which acts as the "locking" mechanism
  • The ability to unstack (or pivot) a table

Load and prepare voting-age population data¶

The data includes multiple CSV files, each with estimates of the number of citizens voting in each county for a particular time period. You will load each CSV file into a data frame.

A dictionary is a good data structure to use to keep track of multiple related datasets.

In [14]:
# Create a dictionary of CSV files for each election year.
cvap_paths = {
    2020: r"C:\LocalData\sp991\GEOM183\Practical1\DataEngineering_and_Visualization\CountyCVAP_2017-2021.csv",
    2016: r"C:\LocalData\sp991\GEOM183\Practical1\DataEngineering_and_Visualization\CountyCVAP_2014-2018.csv",
    2012: r"C:\LocalData\sp991\GEOM183\Practical1\DataEngineering_and_Visualization\CountyCVAP_2010-2014.csv",
    2008: r"C:\LocalData\sp991\GEOM183\Practical1\DataEngineering_and_Visualization\CountyCVAP_2006-2010.csv"
}

# Create a dictionary of data frames for each election year.
cvap_dfs = {year:pd.read_csv(path, encoding="latin-1") for year, path in cvap_paths.items()}

# Display the first three rows of each data frame to check the output.
for df in cvap_dfs.values():
    display(df.head(3))
geoname lntitle geoid lnnumber tot_est tot_moe adu_est adu_moe cit_est cit_moe cvap_est cvap_moe
0 Autauga County, Alabama Total 0500000US01001 1 58240 0 44440 122 57625 310 43905 278
1 Autauga County, Alabama Not Hispanic or Latino 0500000US01001 2 56465 0 43255 123 56100 258 42965 223
2 Autauga County, Alabama American Indian or Alaska Native Alone 0500000US01001 3 100 86 100 86 100 86 100 86
geoname lntitle geoid lnnumber tot_est tot_moe adu_est adu_moe cit_est cit_moe cvap_est cvap_moe
0 Autauga County, Alabama Total 05000US01001 1 55200 0 41830 33 54665 173 41305 174
1 Autauga County, Alabama Not Hispanic or Latino 05000US01001 2 53670 0 40835 32 53395 123 40565 125
2 Autauga County, Alabama American Indian or Alaska Native Alone 05000US01001 3 160 68 85 46 160 68 85 46
GEONAME LNTITLE GEOID LNNUMBER TOT_EST TOT_MOE ADU_EST ADU_MOE CIT_EST CIT_MOE CVAP_EST CVAP_MOE
0 Autauga County, Alabama Total 05000US01001 1 55135 NaN 40920 NaN 54675 198 40545 160
1 Autauga County, Alabama Not Hispanic or Latino 05000US01001 2 53745 NaN 40055 NaN 53510 152 39885 115
2 Autauga County, Alabama American Indian or Alaska Native Alone 05000US01001 3 235 64.0 190 55.0 235 64 190 55
GEONAME LNTITLE GEOID LNNUMBER TOT_EST TOT_MOE ADU_EST ADU_MOE CIT_EST CIT_MOE CVAP_EST CVAP_MOE
0 Autauga County, Alabama Total 05000US01001 1 53155 NaN 38430 NaN 52540 250 38010 158
1 Autauga County, Alabama Not Hispanic or Latino 05000US01001 2 51920 NaN 37685 NaN 51610 148 37480 113
2 Autauga County, Alabama American Indian or Alaska Native Alone 05000US01001 3 160 82.0 135 72.0 160 82 135 72

The four separate CSV files resulted in four separate data frames, with some differences in the column headings. For this analysis, you will want a single table with estimates for total number of citizens of voting age for each year. You will also need a FIPS value to join this table to the election data above and the geoname to spotcheck data. To accomplish these goals, you will perform these steps:

  1. Change all column headings to lowercase formatting so that each data frame has the exact same structure.
  2. Add a year value to each of the data frames to differentiate their values.
  3. Include only total count rows.
  4. Remove any columns other than geoid, cvap_est, and year.
  5. Extract the FIPS value from the geoid value.
  6. Concatenate the data frames together.
In [15]:
# Create a new dictionary to hold the processed dataframes.
cvap_processed_dfs = {}

for year, df in cvap_dfs.items():
    
    # Set column formatting to lowercase.
    df.columns = df.columns.str.lower()
    
    # Add a year value.
    df['year'] = year
    
    # Include only rows of total counts.
    df = df.query("lntitle == 'Total'")

    # Include only necessary columns.
    df = df[['year', 'geoid','geoname','cvap_est']]

    # Extract FIPS from the geoid value.
    df['FIPS'] = df['geoid'].str[-5:]
    
    # Add processed data frames to the dictionary.
    cvap_processed_dfs[year] = df
    
# Concatenate the data frames.
cvap_df = pd.concat(cvap_processed_dfs.values())

# Check the output.
cvap_df
Out[15]:
year geoid geoname cvap_est FIPS
0 2020 0500000US01001 Autauga County, Alabama 43905 01001
13 2020 0500000US01003 Baldwin County, Alabama 174685 01003
26 2020 0500000US01005 Barbour County, Alabama 19625 01005
39 2020 0500000US01007 Bibb County, Alabama 17680 01007
52 2020 0500000US01009 Blount County, Alabama 43635 01009
... ... ... ... ... ...
41808 2008 05000US72145 Vega Baja Municipio, Puerto Rico 44270 72145
41821 2008 05000US72147 Vieques Municipio, Puerto Rico 6970 72147
41834 2008 05000US72149 Villalba Municipio, Puerto Rico 18830 72149
41847 2008 05000US72151 Yabucoa Municipio, Puerto Rico 28295 72151
41860 2008 05000US72153 Yauco Municipio, Puerto Rico 32120 72153

12882 rows × 5 columns

Join tables¶

Now that both the voting-age population (cvap_df) and the election data (elections_pivot_df) tables have been prepared, they can be joined together. The combination of year and FIPS values for each county identifies a unique record, so these fields will be used to join the tables.

In [16]:
voting_info_df = pd.merge(elections_pivot_df, cvap_df, left_on=['FIPS', 'year'], right_on=['FIPS', 'year'], how='left')

# Check the output.
voting_info_df
Out[16]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est
0 2000 01001 AUTAUGA ALABAMA 17208 4942 11993 NaN NaN NaN
1 2000 01003 BALDWIN ALABAMA 56480 13997 40872 NaN NaN NaN
2 2000 01005 BARBOUR ALABAMA 10395 5188 5096 NaN NaN NaN
3 2000 01007 BIBB ALABAMA 7101 2710 4273 NaN NaN NaN
4 2000 01009 BLOUNT ALABAMA 17973 4977 12667 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
18923 2020 56037 SWEETWATER WYOMING 16698 3823 12229 0500000US56037 Sweetwater County, Wyoming 30055.0
18924 2020 56039 TETON WYOMING 14787 9848 4341 0500000US56039 Teton County, Wyoming 17275.0
18925 2020 56041 UINTA WYOMING 9459 1591 7496 0500000US56041 Uinta County, Wyoming 14205.0
18926 2020 56043 WASHAKIE WYOMING 4032 651 3245 0500000US56043 Washakie County, Wyoming 5995.0
18927 2020 56045 WESTON WYOMING 3560 360 3107 0500000US56045 Weston County, Wyoming 5415.0

18928 rows × 10 columns

You notice that the missing values from the voting-age population table are for the year 2000. The earliest voting-age population data that is available starts at the 2008 election. You will remove election data prior to 2008.

In [17]:
voting_info_df = voting_info_df.query("year >= 2008")

# Check the output.
voting_info_df
Out[17]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est
6309 2008 01001 AUTAUGA ALABAMA 23641 6093 17403 05000US01001 Autauga County, Alabama 38010.0
6310 2008 01003 BALDWIN ALABAMA 81413 19386 61271 05000US01003 Baldwin County, Alabama 130865.0
6311 2008 01005 BARBOUR ALABAMA 11630 5697 5866 05000US01005 Barbour County, Alabama 20900.0
6312 2008 01007 BIBB ALABAMA 8644 2299 6262 05000US01007 Bibb County, Alabama 17270.0
6313 2008 01009 BLOUNT ALABAMA 24267 3522 20389 05000US01009 Blount County, Alabama 40605.0
... ... ... ... ... ... ... ... ... ... ...
18923 2020 56037 SWEETWATER WYOMING 16698 3823 12229 0500000US56037 Sweetwater County, Wyoming 30055.0
18924 2020 56039 TETON WYOMING 14787 9848 4341 0500000US56039 Teton County, Wyoming 17275.0
18925 2020 56041 UINTA WYOMING 9459 1591 7496 0500000US56041 Uinta County, Wyoming 14205.0
18926 2020 56043 WASHAKIE WYOMING 4032 651 3245 0500000US56043 Washakie County, Wyoming 5995.0
18927 2020 56045 WESTON WYOMING 3560 360 3107 0500000US56045 Weston County, Wyoming 5415.0

12619 rows × 10 columns

The records at the top and bottom of the data frame now look correct, but you will check whether there are any other null values for the cvap_est field.

In [18]:
voting_info_df.query("cvap_est.isnull()")
Out[18]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est
6376 2008 02001 DISTRICT 1 ALASKA 6970 2597 4149 NaN NaN NaN
6377 2008 02002 DISTRICT 2 ALASKA 7735 3468 4029 NaN NaN NaN
6378 2008 02003 DISTRICT 3 ALASKA 8767 5657 2829 NaN NaN NaN
6379 2008 02004 DISTRICT 4 ALASKA 8736 4161 4302 NaN NaN NaN
6380 2008 02005 DISTRICT 5 ALASKA 7123 3393 3426 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
15878 2020 02038 DISTRICT 38 ALASKA 5299 3202 1737 NaN NaN NaN
15879 2020 02039 DISTRICT 39 ALASKA 5891 3580 1939 NaN NaN NaN
15880 2020 02040 DISTRICT 40 ALASKA 4603 2318 1994 NaN NaN NaN
15881 2020 02099 DISTRICT 99 ALASKA 433 373 59 NaN NaN NaN
17613 2020 36000 KANSAS CITY MISSOURI 136645 107660 26393 NaN NaN NaN

158 rows × 10 columns

The election records for Alaska appear unusual because they are referencing a district rather than a county. Additional research shows that Alaska has a unique government model that does not use counties. The election data is reported for regions that do not correspond to the counties in the voting-age population table.

Because of this data incompatibility, you will remove records from the state of Alaska.

In [19]:
voting_info_df.query("cvap_est.isnull()")

Next, you will query the data frame for any remaining records with null cvap_est values.

In [20]:
voting_info_df.query("cvap_est.isnull()")
Out[20]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est
8148 2008 36000 KANSAS CITY MISSOURI 153219 120102 31854 NaN NaN NaN
11303 2012 36000 KANSAS CITY MISSOURI 136802 105670 29509 NaN NaN NaN
12392 2012 51515 BEDFORD VIRGINIA 2805 1225 1527 NaN NaN NaN
14458 2016 36000 KANSAS CITY MISSOURI 128601 97735 24654 NaN NaN NaN
15547 2016 51515 BEDFORD VIRGINIA 0 0 0 NaN NaN NaN
17613 2020 36000 KANSAS CITY MISSOURI 136645 107660 26393 NaN NaN NaN

You notice two more issues with the data. Additional research explains each of these issues:

  • Kansas City, MO (FIPS 36000), reports its election data separately from Jackson County, MO (FIPS 29095).
  • The independent city of Bedford, VA (FIPS 51515), merged with Bedford County, VA (FIPS 51019), in 2013. The voting-age population data used for the 2012 and 2016 elections was published after the merger and does not include separate data for the city of Bedford.

You can verify the research by querying the records of all four FIPS codes.

In [21]:
voting_info_df.query("FIPS in ['29095', '36000', '51019', '51515']")
Out[21]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est
7850 2008 29095 JACKSON MISSOURI 186047 90722 92833 05000US29095 Jackson County, Missouri 481045.0
8148 2008 36000 KANSAS CITY MISSOURI 153219 120102 31854 NaN NaN NaN
9150 2008 51019 BEDFORD VIRGINIA 35830 11017 24420 05000US51019 Bedford County, Virginia 51755.0
9237 2008 51515 BEDFORD VIRGINIA 2734 1208 1497 05000US51515 Bedford city, Virginia 4595.0
11005 2012 29095 JACKSON MISSOURI 174764 78283 93199 05000US29095 Jackson County, Missouri 493440.0
11303 2012 36000 KANSAS CITY MISSOURI 136802 105670 29509 NaN NaN NaN
12305 2012 51019 BEDFORD VIRGINIA 37425 10209 26679 05000US51019 Bedford County, Virginia 58850.0
12392 2012 51515 BEDFORD VIRGINIA 2805 1225 1527 NaN NaN NaN
14160 2016 29095 JACKSON MISSOURI 173275 71237 91557 05000US29095 Jackson County, Missouri 506340.0
14458 2016 36000 KANSAS CITY MISSOURI 128601 97735 24654 NaN NaN NaN
15460 2016 51019 BEDFORD VIRGINIA 42525 9768 30659 05000US51019 Bedford County, Virginia 61205.0
15547 2016 51515 BEDFORD VIRGINIA 0 0 0 NaN NaN NaN
17315 2020 29095 JACKSON MISSOURI 196418 92182 100142 0500000US29095 Jackson County, Missouri 523040.0
17613 2020 36000 KANSAS CITY MISSOURI 136645 107660 26393 NaN NaN NaN
18615 2020 51019 BEDFORD VIRGINIA 48669 12176 35600 0500000US51019 Bedford County, Virginia 62435.0

To fix these issues, you will merge the data for each city into its respective county. The Pandas groupby function can pair each county with its respective city for each year and then sum the values of each pair together.

In [22]:
# Because the counties to fix are in different states, each group of a county and its associated city can be defined by year and state.
county_groups = voting_info_df.query("FIPS in ['29095', '36000', '51019', '51515']").groupby(['year', 'state'])
summed_votes = county_groups.sum()

# Match index values for summed data with original data.
summed_votes.index = voting_info_df.query("FIPS in ['29095', '51019']").sort_values('year').index

# Check the output.
summed_votes
Out[22]:
totalvotes votes_dem votes_gop cvap_est
7850 339266 210824 124687 481045.0
9150 38564 12225 25917 56350.0
11005 311566 183953 122708 493440.0
12305 40230 11434 28206 58850.0
14160 301876 168972 116211 506340.0
15460 42525 9768 30659 61205.0
17315 333063 199842 126535 523040.0
18615 48669 12176 35600 62435.0
In [23]:
# Update county records with new summed values.
voting_info_df.loc[summed_votes.index, summed_votes.columns] = summed_votes

# Eliminate Kansas City and Bedford records.
voting_info_df = voting_info_df.query("FIPS not in ['36000', '51515']")

# Check the output.
voting_info_df.query("FIPS in ['29095', '36000', '51019', '51515']")
Out[23]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est
7850 2008 29095 JACKSON MISSOURI 339266 210824 124687 05000US29095 Jackson County, Missouri 481045.0
9150 2008 51019 BEDFORD VIRGINIA 38564 12225 25917 05000US51019 Bedford County, Virginia 56350.0
11005 2012 29095 JACKSON MISSOURI 311566 183953 122708 05000US29095 Jackson County, Missouri 493440.0
12305 2012 51019 BEDFORD VIRGINIA 40230 11434 28206 05000US51019 Bedford County, Virginia 58850.0
14160 2016 29095 JACKSON MISSOURI 301876 168972 116211 05000US29095 Jackson County, Missouri 506340.0
15460 2016 51019 BEDFORD VIRGINIA 42525 9768 30659 05000US51019 Bedford County, Virginia 61205.0
17315 2020 29095 JACKSON MISSOURI 333063 199842 126535 0500000US29095 Jackson County, Missouri 523040.0
18615 2020 51019 BEDFORD VIRGINIA 48669 12176 35600 0500000US51019 Bedford County, Virginia 62435.0

The county records now show the correct sum, and the records for Kansas City and the city of Bedford have been removed from the data frame.

In [24]:
# Verify that no record in the output data frame has a null cvap_est value.
voting_info_df.query("cvap_est.isnull()")
Out[24]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est

Calculate additional columns¶

You will use the values from the updated table to add additional columns of information, such as the number of voters who did not choose the Democratic or Republican party, the percentage of voters for each party, and so on. Each column is referred to as an attribute of the dataset.

In [25]:
# Calculate voters who did not choose the Democratic or Republican party.
voting_info_df['votes_other'] = voting_info_df['totalvotes'] - (voting_info_df['votes_dem'] + voting_info_df['votes_gop'])

# Calculate voter share attributes.
voting_info_df['voter_share_major_party'] = (voting_info_df['votes_dem'] + voting_info_df['votes_gop']) / voting_info_df['totalvotes']
voting_info_df['voter_share_dem'] = voting_info_df['votes_dem'] / voting_info_df['totalvotes']
voting_info_df['voter_share_gop'] = voting_info_df['votes_gop'] / voting_info_df['totalvotes']
voting_info_df['voter_share_other'] = voting_info_df['votes_other'] / voting_info_df['totalvotes']

# Calculate raw difference attributes.
voting_info_df['rawdiff_dem_vs_gop'] = voting_info_df['votes_dem'] - voting_info_df['votes_gop']
voting_info_df['rawdiff_gop_vs_dem'] = voting_info_df['votes_gop'] - voting_info_df['votes_dem']
voting_info_df['rawdiff_dem_vs_other'] = voting_info_df['votes_dem'] - voting_info_df['votes_other']
voting_info_df['rawdiff_gop_vs_other'] = voting_info_df['votes_gop'] - voting_info_df['votes_other']
voting_info_df['rawdiff_other_vs_dem'] = voting_info_df['votes_other'] - voting_info_df['votes_dem']
voting_info_df['rawdiff_other_vs_gop'] = voting_info_df['votes_other'] - voting_info_df['votes_gop']

# Calculate percentage difference attributes.
voting_info_df['pctdiff_dem_vs_gop'] = (voting_info_df['votes_dem'] - voting_info_df['votes_gop']) / voting_info_df['totalvotes']
voting_info_df['pctdiff_gop_vs_dem'] = (voting_info_df['votes_gop'] - voting_info_df['votes_dem']) / voting_info_df['totalvotes']
voting_info_df['pctdiff_dem_vs_other'] = (voting_info_df['votes_dem'] - voting_info_df['votes_other']) / voting_info_df['totalvotes']
voting_info_df['pctdiff_gop_vs_other'] = (voting_info_df['votes_gop'] - voting_info_df['votes_other']) / voting_info_df['totalvotes']
voting_info_df['pctdiff_other_vs_dem'] = (voting_info_df['votes_other'] - voting_info_df['votes_dem']) / voting_info_df['totalvotes']
voting_info_df['pctdiff_other_vs_gop'] = (voting_info_df['votes_other'] - voting_info_df['votes_gop']) / voting_info_df['totalvotes']

# Calculate voter turnout attributes.
voting_info_df['voter_turnout'] = voting_info_df['totalvotes'] / voting_info_df['cvap_est']
voting_info_df['voter_turnout_majparty'] = (voting_info_df['votes_dem']+voting_info_df['votes_gop']) / voting_info_df['cvap_est']
voting_info_df['voter_turnout_dem'] = voting_info_df['votes_dem'] / voting_info_df['cvap_est']
voting_info_df['voter_turnout_gop'] = voting_info_df['votes_gop'] / voting_info_df['cvap_est']
voting_info_df['voter_turnout_other'] = voting_info_df['votes_other'] / voting_info_df['cvap_est']

# Determine the winning political party.
def return_winning_party(total_votes_dem, total_votes_gop, total_votes_other):
    if total_votes_dem > total_votes_gop and total_votes_dem > total_votes_other:
        return "Democratic Party"
    elif total_votes_gop > total_votes_dem and total_votes_gop > total_votes_other:
        return "Republican Party"
    elif total_votes_other > total_votes_dem and total_votes_other > total_votes_gop:
        return "Other Party"
    
voting_info_df["Winning Party"] = voting_info_df.apply(
    lambda x: return_winning_party(
        x.votes_dem, 
        x.votes_gop,
        x.votes_other
    ), axis=1)

# Check the output.
voting_info_df.head()
Out[25]:
year FIPS county_name state totalvotes votes_dem votes_gop geoid geoname cvap_est votes_other voter_share_major_party voter_share_dem voter_share_gop voter_share_other rawdiff_dem_vs_gop rawdiff_gop_vs_dem rawdiff_dem_vs_other rawdiff_gop_vs_other rawdiff_other_vs_dem rawdiff_other_vs_gop pctdiff_dem_vs_gop pctdiff_gop_vs_dem pctdiff_dem_vs_other pctdiff_gop_vs_other pctdiff_other_vs_dem pctdiff_other_vs_gop voter_turnout voter_turnout_majparty voter_turnout_dem voter_turnout_gop voter_turnout_other Winning Party
6309 2008 01001 AUTAUGA ALABAMA 23641 6093 17403 05000US01001 Autauga County, Alabama 38010.0 145 0.993867 0.257730 0.736136 0.006133 -11310 11310 5948 17258 -5948 -17258 -0.478406 0.478406 0.251597 0.730003 -0.251597 -0.730003 0.621968 0.618153 0.160300 0.457853 0.003815 Republican Party
6310 2008 01003 BALDWIN ALABAMA 81413 19386 61271 05000US01003 Baldwin County, Alabama 130865.0 756 0.990714 0.238119 0.752595 0.009286 -41885 41885 18630 60515 -18630 -60515 -0.514476 0.514476 0.228833 0.743309 -0.228833 -0.743309 0.622114 0.616337 0.148137 0.468200 0.005777 Republican Party
6311 2008 01005 BARBOUR ALABAMA 11630 5697 5866 05000US01005 Barbour County, Alabama 20900.0 67 0.994239 0.489854 0.504385 0.005761 -169 169 5630 5799 -5630 -5799 -0.014531 0.014531 0.484093 0.498624 -0.484093 -0.498624 0.556459 0.553254 0.272584 0.280670 0.003206 Republican Party
6312 2008 01007 BIBB ALABAMA 8644 2299 6262 05000US01007 Bibb County, Alabama 17270.0 83 0.990398 0.265965 0.724433 0.009602 -3963 3963 2216 6179 -2216 -6179 -0.458468 0.458468 0.256363 0.714831 -0.256363 -0.714831 0.500521 0.495715 0.133121 0.362594 0.004806 Republican Party
6313 2008 01009 BLOUNT ALABAMA 24267 3522 20389 05000US01009 Blount County, Alabama 40605.0 356 0.985330 0.145135 0.840195 0.014670 -16867 16867 3166 20033 -3166 -20033 -0.695059 0.695059 0.130465 0.825524 -0.130465 -0.825524 0.597636 0.588868 0.086738 0.502130 0.008767 Republican Party

Validate the data¶

Before continuing with other data preparation, you should confirm that the output data has been successfully created.

First, you will validate the values for voter turnout. Because these values represent a fraction (total votes divided by voting age population), you will confirm that the values range between 0 and 1.

In [26]:
voting_info_df['voter_turnout'].describe()
Out[26]:
count    12448.000000
mean         0.609919
std          0.101349
min          0.168894
25%          0.542592
50%          0.609886
75%          0.674621
max          2.155556
Name: voter_turnout, dtype: float64

The max value in the column is above 1, indicating a voter turnout above 100%. You will further investigate these anomalous records.

In [27]:
# Perform query for voter turnout above 100%.
turnout_over_1_df = voting_info_df.query('voter_turnout > 1')[['FIPS','county_name','state','year','voter_turnout','totalvotes','cvap_est']].sort_values(['FIPS', 'year'])
turnout_over_1_df
Out[27]:
FIPS county_name state year voter_turnout totalvotes cvap_est
6592 08053 HINSDALE COLORADO 2008 1.346067 599 445.0
16070 08079 MINERAL COLORADO 2020 1.103650 756 685.0
12931 08111 SAN JUAN COLORADO 2016 1.065263 506 475.0
16348 16025 CAMAS IDAHO 2020 1.040000 754 725.0
17399 30033 GARFIELD MONTANA 2020 1.069737 813 760.0
7955 30075 POWDER RIVER MONTANA 2008 1.109881 1404 1265.0
17443 31009 BLAINE NEBRASKA 2020 1.056667 317 300.0
8126 35021 HARDING NEW MEXICO 2008 1.061017 626 590.0
11281 35021 HARDING NEW MEXICO 2012 1.102752 601 545.0
14436 35021 HARDING NEW MEXICO 2016 1.351282 527 390.0
17591 35021 HARDING NEW MEXICO 2020 1.160920 505 435.0
8712 46061 HANSON SOUTH DAKOTA 2008 1.056455 2414 2285.0
11867 46061 HANSON SOUTH DAKOTA 2012 1.061978 2416 2275.0
18177 46061 HANSON SOUTH DAKOTA 2020 1.001258 2388 2385.0
18325 48033 BORDEN TEXAS 2020 1.139726 416 365.0
8974 48261 KENEDY TEXAS 2008 1.035897 202 195.0
18439 48261 KENEDY TEXAS 2020 2.155556 194 90.0
18440 48263 KENT TEXAS 2020 1.050000 462 440.0
18443 48269 KING TEXAS 2020 1.060000 159 150.0
8994 48301 LOVING TEXAS 2008 1.975000 79 40.0
15304 48301 LOVING TEXAS 2016 1.083333 65 60.0
18459 48301 LOVING TEXAS 2020 1.320000 66 50.0
15309 48311 MCMULLEN TEXAS 2016 1.050526 499 475.0
18464 48311 MCMULLEN TEXAS 2020 1.063918 516 485.0
18567 49009 DAGGETT UTAH 2020 1.650667 619 375.0

All of these counties have small populations, which makes it harder to estimate their population of citizens of voting age with a high degree of accuracy. There is also a temporal mismatch between the two datasets used to calculate voter turnout. The estimates for citizens of voting age are compiled from American Community Survey results averaged over a five-year period. The number of votes was calculated at a specific point in time, which may not correspond to the five-year average.

There are three main ways you could address the impossibly high voter turnout values for these counties:

  • Drop the records from the analysis.
  • Keep the records as they are.
  • Adjust the records to a plausible value.

Dropping the records means certain locations will have no data, making it harder to perform a spatial analysis. Keeping the current wrong values will reduce the quality of analysis results. Capping the voter turnout at 1 is a better measure of the actual voter turnout than the current values, especially since the small sizes of these counties indicate that the most likely source of the issue is an underestimate of the population of citizens of voting age.

You will adjust the voter_turnout value for each of these records to 1.

In [28]:
# Adjust values.
voting_info_df.loc[voting_info_df['voter_turnout'] > 1, 'voter_turnout'] = 1

# Describe data distribution.
voting_info_df['voter_turnout'].describe()
Out[28]:
count    12448.000000
mean         0.609503
std          0.098447
min          0.168894
25%          0.542592
50%          0.609886
75%          0.674621
max          1.000000
Name: voter_turnout, dtype: float64

Create the modeling dataset using the 2020 election as response¶

In the current voting_info_df table, each record represents a county and a year. To analyze voter turnout in the 2020 election, you need a dataset where each county is represented only once. One option would be to drop data from previous elections from the table. But previous election turnout values might be useful in predicting turnout for 2020.

You will pivot the table so that each county has one record and so that data for each election is in separate columns.

In [29]:
voting_info_pivot_df = voting_info_df.pivot(
    index=['FIPS'], 
    columns=['year'], 
    values=['totalvotes', 'cvap_est', 'voter_turnout', 'voter_turnout_dem', 'voter_turnout_gop', 'pctdiff_dem_vs_gop', 'rawdiff_dem_vs_gop', 'Winning Party'])

voting_info_pivot_df
Out[29]:
totalvotes cvap_est voter_turnout voter_turnout_dem voter_turnout_gop pctdiff_dem_vs_gop rawdiff_dem_vs_gop Winning Party
year 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020
FIPS
01001 23641 23932 24973 27770 38010.0 40545.0 41305.0 43905.0 0.621968 0.590258 0.6046 0.632502 0.1603 0.156937 0.143711 0.170892 0.457853 0.428635 0.439947 0.451839 -0.478406 -0.460304 -0.489969 -0.444184 -11310 -11016 -12236 -12335 Republican Party Republican Party Republican Party Republican Party
01003 81413 85338 95215 109679 130865.0 144120.0 159195.0 174685.0 0.622114 0.592132 0.598103 0.627867 0.148137 0.127838 0.115946 0.140699 0.4682 0.458063 0.457822 0.478255 -0.514476 -0.557688 -0.571601 -0.537623 -41885 -47592 -54425 -58966 Republican Party Republican Party Republican Party Republican Party
01005 11630 11509 10469 10518 20900.0 20610.0 19990.0 19625.0 0.556459 0.558418 0.523712 0.535949 0.272584 0.286851 0.243672 0.245401 0.28067 0.269287 0.272836 0.286471 -0.014531 0.031454 -0.055688 -0.076631 -169 362 -583 -806 Republican Party Democratic Party Republican Party Republican Party
01007 8644 8420 8819 9595 17270.0 17500.0 17670.0 17680.0 0.500521 0.481143 0.499095 0.542704 0.133121 0.125829 0.106055 0.11233 0.362594 0.3504 0.381324 0.425622 -0.458468 -0.466746 -0.551536 -0.57728 -3963 -3930 -4864 -5539 Republican Party Republican Party Republican Party Republican Party
01009 24267 24006 25588 27588 40605.0 42115.0 42625.0 43635.0 0.597636 0.570011 0.600305 0.632245 0.086738 0.070521 0.050581 0.060502 0.50213 0.492865 0.536282 0.566311 -0.695059 -0.74094 -0.80909 -0.800022 -16867 -17787 -20703 -22071 Republican Party Republican Party Republican Party Republican Party
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
56037 16833 16895 17130 16698 29500.0 30820.0 30770.0 30055.0 0.57061 0.548183 0.556711 0.555581 0.195322 0.154899 0.105005 0.1272 0.351186 0.370798 0.394995 0.406887 -0.273154 -0.393844 -0.520899 -0.503414 -4598 -6654 -8923 -8406 Republican Party Republican Party Republican Party Republican Party
56039 12359 11464 12627 14787 15100.0 16115.0 16730.0 17275.0 0.818477 0.711387 0.754752 0.855977 0.494834 0.385541 0.437179 0.570072 0.302318 0.301458 0.234369 0.251288 0.235213 0.118196 0.26871 0.372422 2907 1355 3393 5507 Democratic Party Democratic Party Democratic Party Democratic Party
56041 8418 8539 8470 9459 14070.0 14455.0 14025.0 14205.0 0.598294 0.59073 0.603922 0.665892 0.164677 0.112625 0.085704 0.112003 0.409595 0.457627 0.438788 0.527702 -0.409361 -0.584026 -0.584652 -0.624273 -3446 -4987 -4952 -5905 Republican Party Republican Party Republican Party Republican Party
56043 4100 3944 3814 4032 6030.0 6030.0 6135.0 5995.0 0.679934 0.654063 0.621679 0.67256 0.172803 0.131675 0.086716 0.10859 0.490216 0.499834 0.474491 0.541284 -0.466829 -0.56288 -0.623755 -0.643353 -1914 -2220 -2379 -2594 Republican Party Republican Party Republican Party Republican Party
56045 3414 3359 3526 3560 5585.0 5510.0 5460.0 5415.0 0.61128 0.609619 0.645788 0.657433 0.117816 0.076588 0.054762 0.066482 0.468756 0.511978 0.555495 0.573777 -0.574107 -0.714201 -0.775383 -0.771629 -1960 -2399 -2734 -2747 Republican Party Republican Party Republican Party Republican Party

3113 rows × 32 columns

If any counties did not have records for a particular election, the pivoted data frame will show null values for those years. You will check whether any records have null values in any field.

In [30]:
voting_info_pivot_df[voting_info_pivot_df.isna().any(axis=1)]
Out[30]:
totalvotes cvap_est voter_turnout voter_turnout_dem voter_turnout_gop pctdiff_dem_vs_gop rawdiff_dem_vs_gop Winning Party
year 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020
FIPS
46102 NaN NaN 2905 3200 NaN NaN 8980.0 8705.0 NaN NaN 0.323497 0.367605 NaN NaN 0.27951 0.324986 NaN NaN 0.026837 0.034118 NaN NaN 0.781067 0.79125 NaN NaN 2269 2532 NaN NaN Democratic Party Democratic Party
46113 3314 3145 NaN NaN 8035.0 8800.0 NaN NaN 0.412446 0.357386 NaN NaN 0.369757 0.33375 NaN NaN 0.041195 0.021364 NaN NaN 0.79662 0.874086 NaN NaN 2640 2749 NaN NaN Democratic Party Democratic Party NaN NaN

Additional research shows that Shannon County, SD (FIPS 46113), was renamed Oglala Lakota County in 2014 and assigned a new FIPS code (46102), which explains the missing data.

To facilitate analysis over time, you will replace the old FIPS code with the new one in the voting_info_df table and then repivot that dataframe.

In [31]:
# Make a copy of the data frame.
voting_info_fix_df = voting_info_df.copy()

# Fix the FIPS code for Oglala Lakota County.
voting_info_fix_df.loc[voting_info_fix_df['FIPS'] == '46113', 'FIPS'] = '46102'

# Pivot the table so that there is a single entry for each county.
voting_info_pivot_df = voting_info_fix_df.pivot(
    index=['FIPS'], 
    columns=['year'], 
    values=['totalvotes', 'cvap_est', 'voter_turnout', 'voter_turnout_dem', 'voter_turnout_gop', 'pctdiff_dem_vs_gop', 'rawdiff_dem_vs_gop', 'Winning Party'])

# Check the output to ensure there are no null values in any field.
voting_info_pivot_df[voting_info_pivot_df.isna().any(axis=1)]
Out[31]:
totalvotes cvap_est voter_turnout voter_turnout_dem voter_turnout_gop pctdiff_dem_vs_gop rawdiff_dem_vs_gop Winning Party
year 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020 2008 2012 2016 2020
FIPS

The pivoted data frame has a multi-index for column names. You will flatten that multi-index and rename each column with the appropriate year.

In [32]:
voting_info_pivot_df.columns = [f"{a}_{b}" for a, b in voting_info_pivot_df.columns]
voting_info_pivot_df = voting_info_pivot_df.reset_index()
voting_info_pivot_df
Out[32]:
FIPS totalvotes_2008 totalvotes_2012 totalvotes_2016 totalvotes_2020 cvap_est_2008 cvap_est_2012 cvap_est_2016 cvap_est_2020 voter_turnout_2008 voter_turnout_2012 voter_turnout_2016 voter_turnout_2020 voter_turnout_dem_2008 voter_turnout_dem_2012 voter_turnout_dem_2016 voter_turnout_dem_2020 voter_turnout_gop_2008 voter_turnout_gop_2012 voter_turnout_gop_2016 voter_turnout_gop_2020 pctdiff_dem_vs_gop_2008 pctdiff_dem_vs_gop_2012 pctdiff_dem_vs_gop_2016 pctdiff_dem_vs_gop_2020 rawdiff_dem_vs_gop_2008 rawdiff_dem_vs_gop_2012 rawdiff_dem_vs_gop_2016 rawdiff_dem_vs_gop_2020 Winning Party_2008 Winning Party_2012 Winning Party_2016 Winning Party_2020
0 01001 23641 23932 24973 27770 38010.0 40545.0 41305.0 43905.0 0.621968 0.590258 0.6046 0.632502 0.1603 0.156937 0.143711 0.170892 0.457853 0.428635 0.439947 0.451839 -0.478406 -0.460304 -0.489969 -0.444184 -11310 -11016 -12236 -12335 Republican Party Republican Party Republican Party Republican Party
1 01003 81413 85338 95215 109679 130865.0 144120.0 159195.0 174685.0 0.622114 0.592132 0.598103 0.627867 0.148137 0.127838 0.115946 0.140699 0.4682 0.458063 0.457822 0.478255 -0.514476 -0.557688 -0.571601 -0.537623 -41885 -47592 -54425 -58966 Republican Party Republican Party Republican Party Republican Party
2 01005 11630 11509 10469 10518 20900.0 20610.0 19990.0 19625.0 0.556459 0.558418 0.523712 0.535949 0.272584 0.286851 0.243672 0.245401 0.28067 0.269287 0.272836 0.286471 -0.014531 0.031454 -0.055688 -0.076631 -169 362 -583 -806 Republican Party Democratic Party Republican Party Republican Party
3 01007 8644 8420 8819 9595 17270.0 17500.0 17670.0 17680.0 0.500521 0.481143 0.499095 0.542704 0.133121 0.125829 0.106055 0.11233 0.362594 0.3504 0.381324 0.425622 -0.458468 -0.466746 -0.551536 -0.57728 -3963 -3930 -4864 -5539 Republican Party Republican Party Republican Party Republican Party
4 01009 24267 24006 25588 27588 40605.0 42115.0 42625.0 43635.0 0.597636 0.570011 0.600305 0.632245 0.086738 0.070521 0.050581 0.060502 0.50213 0.492865 0.536282 0.566311 -0.695059 -0.74094 -0.80909 -0.800022 -16867 -17787 -20703 -22071 Republican Party Republican Party Republican Party Republican Party
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3107 56037 16833 16895 17130 16698 29500.0 30820.0 30770.0 30055.0 0.57061 0.548183 0.556711 0.555581 0.195322 0.154899 0.105005 0.1272 0.351186 0.370798 0.394995 0.406887 -0.273154 -0.393844 -0.520899 -0.503414 -4598 -6654 -8923 -8406 Republican Party Republican Party Republican Party Republican Party
3108 56039 12359 11464 12627 14787 15100.0 16115.0 16730.0 17275.0 0.818477 0.711387 0.754752 0.855977 0.494834 0.385541 0.437179 0.570072 0.302318 0.301458 0.234369 0.251288 0.235213 0.118196 0.26871 0.372422 2907 1355 3393 5507 Democratic Party Democratic Party Democratic Party Democratic Party
3109 56041 8418 8539 8470 9459 14070.0 14455.0 14025.0 14205.0 0.598294 0.59073 0.603922 0.665892 0.164677 0.112625 0.085704 0.112003 0.409595 0.457627 0.438788 0.527702 -0.409361 -0.584026 -0.584652 -0.624273 -3446 -4987 -4952 -5905 Republican Party Republican Party Republican Party Republican Party
3110 56043 4100 3944 3814 4032 6030.0 6030.0 6135.0 5995.0 0.679934 0.654063 0.621679 0.67256 0.172803 0.131675 0.086716 0.10859 0.490216 0.499834 0.474491 0.541284 -0.466829 -0.56288 -0.623755 -0.643353 -1914 -2220 -2379 -2594 Republican Party Republican Party Republican Party Republican Party
3111 56045 3414 3359 3526 3560 5585.0 5510.0 5460.0 5415.0 0.61128 0.609619 0.645788 0.657433 0.117816 0.076588 0.054762 0.066482 0.468756 0.511978 0.555495 0.573777 -0.574107 -0.714201 -0.775383 -0.771629 -1960 -2399 -2734 -2747 Republican Party Republican Party Republican Party Republican Party

3112 rows × 33 columns


Geoenable the data¶

To use this data in a spatial analysis, the data needs to include location information to determine where each county is located on a map. This step geocodes the data by joining it to existing county geometries.

There are various resources that you can use to find geoenabled data. ArcGIS Living Atlas of the World is an authoritative source provided by Esri. Each record in your election data represents information for a county, so you will use an ArcGIS Living Atlas dataset that represents county geometry.

ArcGIS API for Python, which you imported at the beginning of this notebook, extends the pandas data frame to allow spatially enabled data frames that can hold geometry. Converting the feature layer from ArcGIS Living Atlas to a spatially enabled data frame will allow you to join that geometry to the voting data.

Note: Executing the following cell may take a few minutes.

In [33]:
# The relevant layer is available to the public, so you can connect to ArcGIS Online anonymously.
gis = arcgis.gis.GIS()

# Get the USA Census counties layer from ArcGIS Living Atlas. 
item = gis.content.get('14c5450526a8430298b2fa74da12c2f4')

# Convert the layer to a spatially enabled data frame.
counties_sdf = pd.DataFrame.spatial.from_layer(item.layers[0])
counties_sdf.head()
Out[33]:
COUNTY_FIPS FIPS NAME OBJECTID POPULATION POP_SQMI SHAPE SQMI STATE_ABBR STATE_FIPS STATE_NAME Shape__Area Shape__Length
0 001 01001 Autauga County 1 58805 97.3 {"rings": [[[-86.413120727, 32.7073921370001],... 604.37 AL 01 Alabama 0.150256 2.066033
1 003 01003 Baldwin County 2 231767 141.9 {"rings": [[[-87.5649079999999, 30.281622], [-... 1633.14 AL 01 Alabama 0.398404 9.305629
2 005 01005 Barbour County 3 25223 27.9 {"rings": [[[-85.257838372, 32.147937056], [-8... 904.52 AL 01 Alabama 0.22327 2.69526
3 007 01007 Bibb County 4 22293 35.6 {"rings": [[[-87.06574294, 33.2469132270001], ... 626.17 AL 01 Alabama 0.156473 1.887519
4 009 01009 Blount County 5 59134 90.9 {"rings": [[[-86.453024823, 34.259323463], [-8... 650.63 AL 01 Alabama 0.164405 2.423466

The SHAPE field in the spatially enabled data frame represents the county geometry and is used to locate each record, or feature, on the map. You will join that geometry with the voting information you created earlier.

In [34]:
# Join the voting information dataframe with the counties geometry.
geo_sdf = pd.merge(counties_sdf, voting_info_pivot_df, left_on='FIPS', right_on='FIPS', how='right')
geo_sdf.head()
Out[34]:
COUNTY_FIPS FIPS NAME OBJECTID POPULATION POP_SQMI SHAPE SQMI STATE_ABBR STATE_FIPS STATE_NAME Shape__Area Shape__Length totalvotes_2008 totalvotes_2012 totalvotes_2016 totalvotes_2020 cvap_est_2008 cvap_est_2012 cvap_est_2016 cvap_est_2020 voter_turnout_2008 voter_turnout_2012 voter_turnout_2016 voter_turnout_2020 voter_turnout_dem_2008 voter_turnout_dem_2012 voter_turnout_dem_2016 voter_turnout_dem_2020 voter_turnout_gop_2008 voter_turnout_gop_2012 voter_turnout_gop_2016 voter_turnout_gop_2020 pctdiff_dem_vs_gop_2008 pctdiff_dem_vs_gop_2012 pctdiff_dem_vs_gop_2016 pctdiff_dem_vs_gop_2020 rawdiff_dem_vs_gop_2008 rawdiff_dem_vs_gop_2012 rawdiff_dem_vs_gop_2016 rawdiff_dem_vs_gop_2020 Winning Party_2008 Winning Party_2012 Winning Party_2016 Winning Party_2020
0 001 01001 Autauga County 1 58805 97.3 {'rings': [[[-86.413120727, 32.7073921370001],... 604.37 AL 01 Alabama 0.150256 2.066033 23641 23932 24973 27770 38010.0 40545.0 41305.0 43905.0 0.621968 0.590258 0.6046 0.632502 0.1603 0.156937 0.143711 0.170892 0.457853 0.428635 0.439947 0.451839 -0.478406 -0.460304 -0.489969 -0.444184 -11310 -11016 -12236 -12335 Republican Party Republican Party Republican Party Republican Party
1 003 01003 Baldwin County 2 231767 141.9 {'rings': [[[-87.5649079999999, 30.281622], [-... 1633.14 AL 01 Alabama 0.398404 9.305629 81413 85338 95215 109679 130865.0 144120.0 159195.0 174685.0 0.622114 0.592132 0.598103 0.627867 0.148137 0.127838 0.115946 0.140699 0.4682 0.458063 0.457822 0.478255 -0.514476 -0.557688 -0.571601 -0.537623 -41885 -47592 -54425 -58966 Republican Party Republican Party Republican Party Republican Party
2 005 01005 Barbour County 3 25223 27.9 {'rings': [[[-85.257838372, 32.147937056], [-8... 904.52 AL 01 Alabama 0.22327 2.69526 11630 11509 10469 10518 20900.0 20610.0 19990.0 19625.0 0.556459 0.558418 0.523712 0.535949 0.272584 0.286851 0.243672 0.245401 0.28067 0.269287 0.272836 0.286471 -0.014531 0.031454 -0.055688 -0.076631 -169 362 -583 -806 Republican Party Democratic Party Republican Party Republican Party
3 007 01007 Bibb County 4 22293 35.6 {'rings': [[[-87.06574294, 33.2469132270001], ... 626.17 AL 01 Alabama 0.156473 1.887519 8644 8420 8819 9595 17270.0 17500.0 17670.0 17680.0 0.500521 0.481143 0.499095 0.542704 0.133121 0.125829 0.106055 0.11233 0.362594 0.3504 0.381324 0.425622 -0.458468 -0.466746 -0.551536 -0.57728 -3963 -3930 -4864 -5539 Republican Party Republican Party Republican Party Republican Party
4 009 01009 Blount County 5 59134 90.9 {'rings': [[[-86.453024823, 34.259323463], [-8... 650.63 AL 01 Alabama 0.164405 2.423466 24267 24006 25588 27588 40605.0 42115.0 42625.0 43635.0 0.597636 0.570011 0.600305 0.632245 0.086738 0.070521 0.050581 0.060502 0.50213 0.492865 0.536282 0.566311 -0.695059 -0.74094 -0.80909 -0.800022 -16867 -17787 -20703 -22071 Republican Party Republican Party Republican Party Republican Party

The merge used a right join to ensure that every record in the election data is included in the output. However, with a right join, it is possible that election records for some counties were not joined to a geometry. To confirm that the join was successful, you will check to see whether any record has null geometry.

In [35]:
geo_sdf.query("SHAPE.isnull()")
Out[35]:
COUNTY_FIPS FIPS NAME OBJECTID POPULATION POP_SQMI SHAPE SQMI STATE_ABBR STATE_FIPS STATE_NAME Shape__Area Shape__Length totalvotes_2008 totalvotes_2012 totalvotes_2016 totalvotes_2020 cvap_est_2008 cvap_est_2012 cvap_est_2016 cvap_est_2020 voter_turnout_2008 voter_turnout_2012 voter_turnout_2016 voter_turnout_2020 voter_turnout_dem_2008 voter_turnout_dem_2012 voter_turnout_dem_2016 voter_turnout_dem_2020 voter_turnout_gop_2008 voter_turnout_gop_2012 voter_turnout_gop_2016 voter_turnout_gop_2020 pctdiff_dem_vs_gop_2008 pctdiff_dem_vs_gop_2012 pctdiff_dem_vs_gop_2016 pctdiff_dem_vs_gop_2020 rawdiff_dem_vs_gop_2008 rawdiff_dem_vs_gop_2012 rawdiff_dem_vs_gop_2016 rawdiff_dem_vs_gop_2020 Winning Party_2008 Winning Party_2012 Winning Party_2016 Winning Party_2020

Because no record has null geometry, you know that every voting record was joined to a county.

Convert data frames to feature classes¶

Define the ArcGIS Pro project, map, workspace, and database¶
In [36]:
# Create variables that represent the ArcGIS Pro project and map.
aprx = arcpy.mp.ArcGISProject("CURRENT")
m = aprx.listMaps('Data Engineering')[0]

# Create a variable that represents the default file geodatabase.
arcpy.env.workspace = aprx.defaultGeodatabase
arcpy.env.addOutputsToMap = False
In [37]:
arcpy.env.workspace
Out[37]:
'C:\\LocalData\\sp991\\GEOM183\\Practical1\\DataEngineering_and_Visualization\\DataEngineering_and_Visualization.gdb'

The pandas extension provided by the ArcGIS API for Python also enables exporting a spatially enabled data frame to a feature class.

Note: Executing the following cell may take a few minutes

In [38]:
# Create a feature class for the 2020 presidential election.
out_fc_name = "county_elections_pres"
out_fc_path = os.path.join(arcpy.env.workspace, out_fc_name)
out_fc = geo_sdf.spatial.to_featureclass(out_fc_path)
out_fc
Out[38]:
'C:\\LocalData\\sp991\\GEOM183\\Practical1\\DataEngineering_and_Visualization\\DataEngineering_and_Visualization.gdb\\county_elections_pres'

You used an ArcGIS notebook to complete the necessary data engineering tasks. Data engineering was needed to prepare this data for a predictive analysis that will model voter turnout using demographic variables, such as per capita income.

Return to the exercise and go to the Modify Environment Settings step for the remaining instructions.¶